The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Wed Jun 24 22:40:52 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Wed Jun 24 22:40:52 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.23.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.23.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.23.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.23.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 154 154
Albania 154 154
Algeria 154 154
Andorra 154 154
Angola 154 154
Antigua and Barbuda 154 154
Argentina 154 154
Armenia 154 154
Australia 1232 1232
Austria 154 154
Azerbaijan 154 154
Bahamas 154 154
Bahrain 154 154
Bangladesh 154 154
Barbados 154 154
Belarus 154 154
Belgium 154 154
Belize 154 154
Benin 154 154
Bhutan 154 154
Bolivia 154 154
Bosnia and Herzegovina 154 154
Botswana 154 154
Brazil 154 154
Brunei 154 154
Bulgaria 154 154
Burkina Faso 154 154
Burma 154 154
Burundi 154 154
Cabo Verde 154 154
Cambodia 154 154
Cameroon 154 154
Canada 2156 2156
Central African Republic 154 154
Chad 154 154
Chile 154 154
China 5082 5082
Colombia 154 154
Comoros 154 154
Congo (Brazzaville) 154 154
Congo (Kinshasa) 154 154
Costa Rica 154 154
Cote d’Ivoire 154 154
Croatia 154 154
Cuba 154 154
Cyprus 154 154
Czechia 154 154
Denmark 462 462
Diamond Princess 154 154
Djibouti 154 154
Dominica 154 154
Dominican Republic 154 154
Ecuador 154 154
Egypt 154 154
El Salvador 154 154
Equatorial Guinea 154 154
Eritrea 154 154
Estonia 154 154
Eswatini 154 154
Ethiopia 154 154
Fiji 154 154
Finland 154 154
France 1694 1694
Gabon 154 154
Gambia 154 154
Georgia 154 154
Germany 154 154
Ghana 154 154
Greece 154 154
Grenada 154 154
Guatemala 154 154
Guinea 154 154
Guinea-Bissau 154 154
Guyana 154 154
Haiti 154 154
Holy See 154 154
Honduras 154 154
Hungary 154 154
Iceland 154 154
India 154 154
Indonesia 154 154
Iran 154 154
Iraq 154 154
Ireland 154 154
Israel 154 154
Italy 154 154
Jamaica 154 154
Japan 154 154
Jordan 154 154
Kazakhstan 154 154
Kenya 154 154
Korea, South 154 154
Kosovo 154 154
Kuwait 154 154
Kyrgyzstan 154 154
Laos 154 154
Latvia 154 154
Lebanon 154 154
Lesotho 154 154
Liberia 154 154
Libya 154 154
Liechtenstein 154 154
Lithuania 154 154
Luxembourg 154 154
Madagascar 154 154
Malawi 154 154
Malaysia 154 154
Maldives 154 154
Mali 154 154
Malta 154 154
Mauritania 154 154
Mauritius 154 154
Mexico 154 154
Moldova 154 154
Monaco 154 154
Mongolia 154 154
Montenegro 154 154
Morocco 154 154
Mozambique 154 154
MS Zaandam 154 154
Namibia 154 154
Nepal 154 154
Netherlands 770 770
New Zealand 154 154
Nicaragua 154 154
Niger 154 154
Nigeria 154 154
North Macedonia 154 154
Norway 154 154
Oman 154 154
Pakistan 154 154
Panama 154 154
Papua New Guinea 154 154
Paraguay 154 154
Peru 154 154
Philippines 154 154
Poland 154 154
Portugal 154 154
Qatar 154 154
Romania 154 154
Russia 154 154
Rwanda 154 154
Saint Kitts and Nevis 154 154
Saint Lucia 154 154
Saint Vincent and the Grenadines 154 154
San Marino 154 154
Sao Tome and Principe 154 154
Saudi Arabia 154 154
Senegal 154 154
Serbia 154 154
Seychelles 154 154
Sierra Leone 154 154
Singapore 154 154
Slovakia 154 154
Slovenia 154 154
Somalia 154 154
South Africa 154 154
South Sudan 154 154
Spain 154 154
Sri Lanka 154 154
Sudan 154 154
Suriname 154 154
Sweden 154 154
Switzerland 154 154
Syria 154 154
Taiwan* 154 154
Tajikistan 154 154
Tanzania 154 154
Thailand 154 154
Timor-Leste 154 154
Togo 154 154
Trinidad and Tobago 154 154
Tunisia 154 154
Turkey 154 154
Uganda 154 154
Ukraine 154 154
United Arab Emirates 154 154
United Kingdom 1694 1694
Uruguay 154 154
US 154 154
US_state 502194 502194
Uzbekistan 154 154
Venezuela 154 154
Vietnam 154 154
West Bank and Gaza 154 154
Western Sahara 154 154
Yemen 154 154
Zambia 154 154
Zimbabwe 154 154
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 6165
Alaska 1270
Arizona 1496
Arkansas 6555
California 5700
Colorado 5567
Connecticut 889
Delaware 375
Diamond Princess 99
District of Columbia 100
Florida 6470
Georgia 14679
Grand Princess 100
Guam 100
Hawaii 512
Idaho 3006
Illinois 8690
Indiana 8493
Iowa 8180
Kansas 6971
Kentucky 9826
Louisiana 6100
Maine 1559
Maryland 2345
Massachusetts 1478
Michigan 7594
Minnesota 7261
Mississippi 7573
Missouri 8839
Montana 2710
Nebraska 5409
Nevada 1221
New Hampshire 1060
New Jersey 2235
New Mexico 2676
New York 5715
North Carolina 9036
North Dakota 3349
Northern Mariana Islands 85
Ohio 8029
Oklahoma 6275
Oregon 3093
Pennsylvania 6276
Puerto Rico 100
Rhode Island 592
South Carolina 4388
South Dakota 4239
Tennessee 8726
Texas 19156
Utah 1357
Vermont 1425
Virgin Islands 100
Virginia 11505
Washington 3892
West Virginia 4342
Wisconsin 6218
Wyoming 1941
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 71595
China 154
Diamond Princess 135
Korea, South 125
Japan 124
Italy 122
Iran 119
Singapore 116
France 115
Germany 115
Spain 114
US 112
Switzerland 111
United Kingdom 111
Belgium 110
Netherlands 110
Norway 110
Sweden 110
Austria 108
Malaysia 107
Australia 106
Bahrain 106
Denmark 106
Canada 105
Qatar 105
Iceland 104
Brazil 103
Czechia 103
Finland 103
Greece 103
Iraq 103
Israel 103
Portugal 103
Slovenia 103
Egypt 102
Estonia 102
India 102
Ireland 102
Kuwait 102
Philippines 102
Poland 102
Romania 102
Saudi Arabia 102
Indonesia 101
Lebanon 101
Pakistan 101
San Marino 101
Thailand 101
Chile 100
Luxembourg 99
Peru 99
Russia 99
Ecuador 98
Mexico 98
Slovakia 98
South Africa 98
United Arab Emirates 98
Armenia 97
Colombia 97
Croatia 97
Panama 97
Serbia 97
Taiwan* 97
Turkey 97
Argentina 96
Bulgaria 96
Latvia 96
Uruguay 96
Algeria 95
Costa Rica 95
Dominican Republic 95
Hungary 95
Andorra 94
Bosnia and Herzegovina 94
Jordan 94
Lithuania 94
Morocco 94
New Zealand 94
North Macedonia 94
Vietnam 94
Albania 93
Cyprus 93
Malta 93
Moldova 93
Brunei 92
Burkina Faso 92
Sri Lanka 92
Tunisia 92
Ukraine 91
Azerbaijan 90
Ghana 90
Kazakhstan 90
Oman 90
Senegal 90
Venezuela 90
Afghanistan 89
Cote d’Ivoire 89
Cuba 88
Mauritius 88
Uzbekistan 88
Cambodia 87
Cameroon 87
Honduras 87
Nigeria 87
West Bank and Gaza 87
Belarus 86
Georgia 86
Bolivia 85
Kosovo 85
Kyrgyzstan 85
Montenegro 85
Congo (Kinshasa) 84
Kenya 83
Niger 82
Guinea 81
Rwanda 81
Trinidad and Tobago 81
Paraguay 80
Bangladesh 79
Djibouti 77
El Salvador 76
Guatemala 75
Madagascar 74
Mali 73
Congo (Brazzaville) 70
Jamaica 70
Gabon 68
Somalia 68
Tanzania 68
Ethiopia 67
Burma 66
Sudan 65
Liberia 64
Maldives 62
Equatorial Guinea 61
Cabo Verde 59
Sierra Leone 57
Guinea-Bissau 56
Togo 56
Zambia 55
Eswatini 54
Chad 53
Tajikistan 52
Haiti 50
Sao Tome and Principe 50
Benin 48
Nepal 48
Uganda 48
Central African Republic 47
South Sudan 47
Guyana 45
Mozambique 44
Yemen 40
Mongolia 39
Mauritania 36
Nicaragua 36
Malawi 30
Syria 30
Zimbabwe 28
Bahamas 27
Libya 27
Comoros 25
Suriname 17
Angola 14
Eritrea 9
Burundi 8
Monaco 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 154
Korea, South 125
Japan 124
Italy 122
Iran 119
Singapore 116
France 115
Germany 115
Spain 114
US 112
Switzerland 111
United Kingdom 111
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-04-16        18368
## 2    Afghanistan           <NA> <NA> 2020-05-03        18385
## 3    Afghanistan           <NA> <NA> 2020-05-04        18386
## 4    Afghanistan           <NA> <NA> 2020-05-01        18383
## 5    Afghanistan           <NA> <NA> 2020-05-02        18384
## 6    Afghanistan           <NA> <NA> 2020-04-29        18381
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                     30                   840     0.03571429
## 2                     85                  2704     0.03143491
## 3                     90                  2894     0.03109883
## 4                     68                  2335     0.02912206
## 5                     72                  2469     0.02916160
## 6                     60                  1939     0.03094379
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  2.924279                   1.477121        18348
## 2                  3.432007                   1.929419        18348
## 3                  3.461499                   1.954243        18348
## 4                  3.368287                   1.832509        18348
## 5                  3.392521                   1.857332        18348
## 6                  3.287578                   1.778151        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             20  NA   NA         NA                           NA
## 2             37  NA   NA         NA                           NA
## 3             38  NA   NA         NA                           NA
## 4             35  NA   NA         NA                           NA
## 5             36  NA   NA         NA                           NA
## 6             33  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Parks 0.9998011 -72.0
Hawaii Transit 0.9947783 -89.0
New Hampshire Parks 0.9499544 -20.0
Vermont Parks 0.9292450 -35.5
Maine Transit -0.9171857 -50.0
Connecticut Grocery_Pharmacy -0.8822271 -6.0
Utah Residential -0.8592659 12.0
Utah Transit -0.8517295 -18.0
Hawaii Grocery_Pharmacy 0.8505237 -34.0
South Dakota Parks 0.7941168 -26.0
Hawaii Retail_Recreation 0.7747086 -56.0
Rhode Island Workplace -0.7691260 -39.5
Arizona Grocery_Pharmacy -0.7684763 -15.0
Arizona Residential 0.7583394 13.0
Connecticut Transit -0.7520864 -50.0
Massachusetts Workplace -0.7508105 -39.0
Wyoming Parks -0.7387676 -4.0
Alaska Residential 0.7355127 13.0
Alaska Grocery_Pharmacy -0.7170269 -7.0
Nevada Transit -0.7029312 -20.0
Alaska Workplace -0.6631559 -33.0
Vermont Grocery_Pharmacy -0.6586262 -25.0
New York Workplace -0.6542121 -34.5
North Dakota Parks 0.6391841 -34.0
Idaho Residential -0.6364649 11.0
Rhode Island Retail_Recreation -0.6280705 -45.0
Utah Parks -0.5986697 17.0
Rhode Island Residential -0.5981837 18.5
New Jersey Parks -0.5979205 -6.0
New York Retail_Recreation -0.5908120 -46.0
Maine Workplace -0.5823998 -30.0
Nebraska Workplace 0.5799585 -32.0
Arkansas Parks 0.5595636 -12.0
Utah Workplace -0.5334406 -37.0
New York Parks 0.5298450 20.0
New Jersey Workplace -0.5223803 -44.0
Connecticut Retail_Recreation -0.5160696 -45.0
Connecticut Residential 0.5076989 14.0
Arizona Retail_Recreation -0.5054850 -42.5
Maine Parks 0.4999679 -31.0
Massachusetts Retail_Recreation -0.4943900 -44.0
New Mexico Grocery_Pharmacy -0.4849802 -11.0
New Jersey Grocery_Pharmacy -0.4847149 2.5
Arizona Transit 0.4709366 -38.0
Connecticut Workplace -0.4703456 -39.0
Nebraska Residential -0.4660959 14.0
New Mexico Residential 0.4589472 13.5
Maryland Workplace -0.4574440 -35.0
Montana Parks -0.4524120 -58.0
Rhode Island Parks 0.4463111 52.0
Iowa Parks -0.4406799 28.5
Missouri Residential -0.4396598 13.0
North Dakota Retail_Recreation -0.4268445 -42.0
West Virginia Parks 0.4245907 -33.0
Illinois Transit -0.4195225 -31.0
Pennsylvania Workplace -0.4184960 -36.0
Kentucky Parks -0.4152071 28.5
Maryland Grocery_Pharmacy -0.4087394 -10.0
New Jersey Retail_Recreation -0.4075410 -62.5
Massachusetts Grocery_Pharmacy -0.4060062 -7.0
New Jersey Transit -0.4049316 -50.5
New Mexico Parks 0.3962897 -31.5
Montana Residential 0.3959127 14.0
Pennsylvania Retail_Recreation -0.3940290 -45.0
Vermont Residential 0.3933219 11.5
New Hampshire Residential -0.3907156 14.0
Oregon Parks -0.3865357 16.5
South Carolina Parks -0.3844253 -23.0
Alabama Workplace -0.3818537 -29.0
New Mexico Retail_Recreation -0.3772964 -42.5
South Carolina Workplace 0.3698404 -30.0
New York Transit -0.3656961 -48.0
Michigan Parks 0.3631126 28.5
Alabama Grocery_Pharmacy -0.3554912 -2.0
Montana Transit 0.3522327 -41.0
Nebraska Grocery_Pharmacy 0.3414093 -0.5
Wisconsin Transit -0.3310962 -23.5
North Dakota Workplace 0.3267674 -40.0
Maryland Retail_Recreation -0.3241366 -39.0
Virginia Transit -0.3192485 -32.5
Arkansas Retail_Recreation -0.3170989 -30.0
Maine Retail_Recreation -0.3150214 -42.0
Kansas Parks 0.3148976 72.0
Alaska Retail_Recreation 0.3136481 -39.0
Florida Residential 0.3101084 14.0
Colorado Residential 0.3080361 14.0
Idaho Workplace -0.3050337 -29.0
Nevada Residential 0.3020367 17.0
California Parks -0.3008683 -38.5
California Residential 0.2981679 14.0
Illinois Workplace -0.2948692 -30.5
Idaho Grocery_Pharmacy -0.2910004 -5.5
California Transit -0.2830080 -42.0
Minnesota Transit -0.2829572 -28.5
Pennsylvania Parks 0.2793250 12.0
Idaho Transit -0.2758650 -30.0
Pennsylvania Grocery_Pharmacy -0.2667252 -6.0
Missouri Workplace 0.2662445 -29.0
North Carolina Workplace 0.2655072 -31.0
Vermont Retail_Recreation 0.2639835 -57.0
Maryland Residential 0.2637933 15.0
Rhode Island Grocery_Pharmacy 0.2621455 -7.5
West Virginia Grocery_Pharmacy -0.2619890 -6.0
Vermont Workplace -0.2610926 -43.0
Kansas Workplace 0.2605419 -32.0
North Carolina Grocery_Pharmacy 0.2600387 0.0
Alabama Transit -0.2527298 -36.5
Illinois Parks 0.2524148 26.5
Wyoming Grocery_Pharmacy -0.2516940 -10.0
Oregon Grocery_Pharmacy -0.2490360 -7.0
Tennessee Workplace -0.2488740 -31.0
Nevada Workplace -0.2481077 -40.0
Georgia Grocery_Pharmacy -0.2466673 -10.0
Rhode Island Transit -0.2460045 -56.0
Wisconsin Parks 0.2381459 51.5
Minnesota Parks 0.2375970 -9.0
Tennessee Parks -0.2357483 10.5
Missouri Parks 0.2353568 0.0
Tennessee Residential 0.2348968 11.5
Washington Grocery_Pharmacy 0.2347923 -7.0
Texas Workplace 0.2326515 -32.0
Hawaii Workplace -0.2291808 -46.0
Florida Parks -0.2287866 -43.0
Indiana Parks -0.2277577 29.0
New York Grocery_Pharmacy -0.2277044 8.0
South Dakota Workplace 0.2266282 -35.0
Georgia Workplace -0.2251551 -33.5
Arizona Parks -0.2160926 -44.5
Nevada Retail_Recreation -0.2140783 -43.0
Georgia Retail_Recreation -0.2111197 -41.0
West Virginia Workplace 0.2093669 -33.0
Mississippi Residential 0.2091520 13.0
North Dakota Grocery_Pharmacy -0.2083182 -8.0
Alabama Parks 0.2065361 -1.0
Oregon Residential -0.2050338 10.5
Nebraska Transit -0.2048701 -9.0
Connecticut Parks 0.2013098 43.0
Wyoming Workplace -0.1994895 -31.0
Illinois Residential 0.1987588 14.0
North Carolina Transit 0.1972978 -32.0
Texas Residential -0.1945810 15.0
Kansas Grocery_Pharmacy -0.1941371 -14.5
Mississippi Grocery_Pharmacy -0.1924055 -8.0
South Dakota Residential 0.1889408 15.0
Colorado Parks -0.1879083 2.0
Wisconsin Residential -0.1829324 14.0
Oklahoma Parks 0.1790439 -18.5
North Carolina Residential 0.1788698 13.0
Virginia Residential 0.1782995 14.0
New Hampshire Retail_Recreation -0.1723050 -41.0
Pennsylvania Transit -0.1708340 -42.0
Ohio Transit 0.1671835 -28.0
Idaho Parks 0.1643543 -22.0
Massachusetts Parks 0.1635366 39.0
Indiana Residential 0.1625622 12.0
Kentucky Grocery_Pharmacy 0.1619293 4.0
Utah Retail_Recreation -0.1617216 -40.0
Kentucky Transit -0.1599864 -31.0
Kentucky Residential 0.1553932 12.0
New Mexico Transit 0.1469499 -38.5
Idaho Retail_Recreation -0.1462595 -39.5
Montana Retail_Recreation 0.1448247 -51.0
New Jersey Residential 0.1447885 18.0
North Dakota Residential -0.1446750 17.0
Texas Parks 0.1434821 -42.0
Minnesota Retail_Recreation 0.1351633 -40.5
West Virginia Residential -0.1332531 11.0
Mississippi Retail_Recreation -0.1319232 -40.0
Oregon Transit 0.1309742 -27.5
Virginia Grocery_Pharmacy -0.1304581 -8.0
Iowa Transit 0.1291881 -24.0
Kansas Transit -0.1267579 -23.0
North Carolina Retail_Recreation 0.1252094 -34.0
Texas Grocery_Pharmacy 0.1249013 -14.0
Wisconsin Grocery_Pharmacy 0.1230143 -1.0
Utah Grocery_Pharmacy 0.1228184 -4.0
North Dakota Transit 0.1226265 -48.0
Indiana Retail_Recreation 0.1204743 -38.0
Michigan Workplace -0.1198068 -40.0
Mississippi Parks -0.1181610 -25.0
Wyoming Transit -0.1162587 -17.0
Oklahoma Residential -0.1120022 15.0
California Grocery_Pharmacy -0.1110331 -11.5
South Carolina Transit -0.1088430 -45.0
Nebraska Retail_Recreation 0.1071852 -36.0
Wisconsin Workplace -0.1068531 -31.0
Iowa Workplace 0.1052061 -30.0
Oklahoma Grocery_Pharmacy -0.1044421 -0.5
Montana Workplace -0.1041387 -40.0
Virginia Parks 0.1037106 6.0
Alabama Retail_Recreation 0.1036184 -39.0
Oklahoma Workplace 0.1033831 -31.0
New Hampshire Grocery_Pharmacy -0.1033726 -6.0
Oregon Retail_Recreation 0.1032504 -40.5
Maryland Transit -0.1029669 -39.0
Hawaii Residential -0.1020592 19.0
Missouri Transit -0.1006983 -24.5
Massachusetts Transit -0.1003359 -45.0
Arkansas Workplace -0.0996413 -26.0
Massachusetts Residential 0.0983259 15.0
California Retail_Recreation -0.0982821 -44.0
Minnesota Grocery_Pharmacy 0.0980995 -6.0
Indiana Workplace 0.0943839 -34.0
Iowa Grocery_Pharmacy -0.0941688 4.0
Florida Transit -0.0922672 -49.0
Michigan Transit 0.0921637 -46.0
New York Residential 0.0895611 17.5
Florida Retail_Recreation 0.0889200 -43.0
Wyoming Residential 0.0887040 12.5
West Virginia Transit -0.0876053 -45.0
Michigan Retail_Recreation -0.0811897 -53.0
Virginia Retail_Recreation -0.0803017 -35.0
Ohio Residential 0.0786617 14.0
Alabama Residential -0.0784466 11.0
Minnesota Residential -0.0750115 17.0
Michigan Residential 0.0749640 15.0
Georgia Transit -0.0739371 -35.0
Pennsylvania Residential 0.0725426 15.0
Ohio Grocery_Pharmacy 0.0696852 0.0
North Carolina Parks -0.0692924 7.0
Kentucky Retail_Recreation 0.0689948 -29.0
Washington Transit -0.0680645 -33.5
Washington Retail_Recreation 0.0649389 -42.0
Texas Retail_Recreation 0.0641052 -40.0
West Virginia Retail_Recreation -0.0636773 -38.5
Virginia Workplace -0.0622992 -32.0
South Dakota Transit -0.0618311 -40.0
South Dakota Retail_Recreation -0.0606755 -39.0
Georgia Parks 0.0600104 -6.0
Vermont Transit -0.0586308 -63.0
Texas Transit 0.0548426 -41.5
Iowa Retail_Recreation 0.0543932 -38.0
Florida Workplace 0.0534481 -33.0
Tennessee Transit 0.0529297 -32.0
Nevada Parks 0.0502160 -12.5
Nebraska Parks 0.0465749 55.5
Arkansas Grocery_Pharmacy -0.0463379 3.0
Mississippi Transit -0.0462755 -38.5
Arkansas Transit -0.0461642 -27.0
Wyoming Retail_Recreation 0.0458543 -39.0
Indiana Transit 0.0455155 -29.0
New Hampshire Workplace 0.0433376 -37.0
South Carolina Residential -0.0427481 12.0
Ohio Retail_Recreation 0.0422397 -36.0
Arizona Workplace -0.0417396 -35.0
Kentucky Workplace -0.0406420 -36.5
Oregon Workplace -0.0395921 -31.0
Ohio Workplace -0.0395084 -35.0
Georgia Residential -0.0394476 13.0
New Hampshire Transit 0.0394087 -57.0
New Mexico Workplace 0.0388753 -34.0
Colorado Retail_Recreation -0.0384142 -44.0
Arkansas Residential -0.0366960 12.0
Maine Residential -0.0360935 11.0
Illinois Grocery_Pharmacy -0.0337659 2.0
Colorado Transit 0.0331212 -36.0
Minnesota Workplace -0.0323154 -33.0
Colorado Grocery_Pharmacy -0.0320364 -17.0
California Workplace -0.0312230 -36.0
Ohio Parks -0.0309801 67.5
Illinois Retail_Recreation 0.0308512 -40.0
Tennessee Retail_Recreation -0.0300615 -30.0
Missouri Grocery_Pharmacy -0.0296419 2.0
Florida Grocery_Pharmacy 0.0269600 -14.0
Missouri Retail_Recreation -0.0230287 -36.0
Oklahoma Retail_Recreation -0.0195166 -31.0
Maine Grocery_Pharmacy -0.0191347 -13.0
Mississippi Workplace 0.0186317 -33.0
Oklahoma Transit 0.0179045 -26.0
Kansas Residential -0.0150248 13.0
Montana Grocery_Pharmacy -0.0149647 -16.0
Wisconsin Retail_Recreation 0.0143330 -44.0
South Dakota Grocery_Pharmacy -0.0138291 -9.0
Maryland Parks 0.0136926 27.0
Washington Residential -0.0136361 13.0
Washington Workplace 0.0118784 -38.0
Washington Parks 0.0110646 -3.5
Tennessee Grocery_Pharmacy 0.0090792 6.0
Nevada Grocery_Pharmacy -0.0069692 -12.5
Colorado Workplace 0.0066991 -39.0
Michigan Grocery_Pharmacy 0.0059861 -11.0
Indiana Grocery_Pharmacy -0.0058101 -5.5
Iowa Residential 0.0049871 13.0
South Carolina Grocery_Pharmacy -0.0040748 1.0
South Carolina Retail_Recreation 0.0039830 -35.0
Kansas Retail_Recreation -0.0001007 -39.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Wed Jun 24 22:42:15 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net